Random Forests with Missing Values in the Covariates

نویسندگان

  • Anna Rieger
  • Torsten Hothorn
  • Carolin Strobl
چکیده

In Random Forests [2] several trees are constructed from bootstrapor subsamples of the original data. Random Forests have become very popular, e.g., in the fields of genetics and bioinformatics, because they can deal with high-dimensional problems including complex interaction effects. Conditional Inference Forests [8] provide an implementation of Random Forests with unbiased variable selection. Like the original Random Forests, they employ surrogate variables to handle missing values in the predictor variables. In this paper we report the results of an extensive simulation study covering both classification and regression problems under a variety of scenarios, including different missing value generating processes as well as different correlation structures between the variables. Moreover, a high dimensional setting with a high number of noise variables was considered in each case. The results compare the performance of Conditional Inference Forests with surrogate variables to that of knn imputation prior to fitting. The results show that while in some settings one or the other approach is slightly superior, there is no overall difference in the performance of Conditional Inference Forests with surrogate variables and with prior knn-imputation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Imputation of Missing Values for Unsupervised Data Using the Proximity in Random Forests

This paper presents a new procedure that imputes missing values by random forests for unsupervised data. We found that it works pretty well compared with k-nearest neighbor (kNN) and rough imputations replacing the median of the variables. Moreover, this procedure can be expanded to semisupervised data sets. The rate of the correct classification is higher than that of other conventional method...

متن کامل

مقایسه روش بیزی (Bayesian) و کلاسیک در برآرد پارامترهای مدل رگرسیون لجستیک با وجود مقادیر گمشده در متغیرهای کمکی

Background and Aim: Logistic regression is an analytic tool widely used in medical and epidemiologic research. In many studies, we face data sets in which some of the data are not recorded. A simple way to deal with such "missing data" is to simply ignore the subjects with missing observations, and perform the analysis on cases for which complete data are available. Materials and Methods: We c...

متن کامل

Evaluation of Imputation of Covariates in an Impact Analysis With Regression Adjustment

In an impact analysis using random assignment, researchers often deal with missing values in both the covariates and the outcome variables of regression models. Clearly rigorous methods are needed to impute missing values in the outcome variables to minimize the potential bias in impact assessments. When imputation is applied to covariates of the regression analyses, the effect of imputation is...

متن کامل

A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data

BACKGROUND Random survival forest (RSF) models have been identified as alternative methods to the Cox proportional hazards model in analysing time-to-event data. These methods, however, have been criticised for the bias that results from favouring covariates with many split-points and hence conditional inference forests for time-to-event data have been suggested. Conditional inference forests (...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010